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Abstract — Top-fc queries allow end-users to focus on the most 
important (top-fc) answers amongst those which satisfy the query. 
In traditional databases, a user defined score function assigns a 
score value to each tuple and a top-fc query returns fc tuples with 
the highest score. In uncertain database, top-fc answer depends 
not only on the scores but also on the membership probabilities 
of tuples. Several top-fc definitions covering different aspects 
of score-probability interplay have been proposed in recent 
past 1 10 1, 1 4 1, 1 2 1, |8|. Most of the existing work in this research 
field is focused on developing efficient algorithms for answering 
top-fc queries on static uncertain data. Any change (insertion, 
deletion of a tuple or change in membership probability, score 
of a tuple) in underlying data forces re-computation of query 
answers. Such re-computations are not practical considering the 
dynamic nature of data in many applications. In this paper, we 
propose a fully dynamic data structure that uses ranking function 
PRF'^{a) proposed by Li et al. |8| under the generally adopted 
model of x-relations |11|. PRF'^ can effectively approximate 
various other top-fc definitions on uncertain data based on the 
value of parameter a. An x-relation consists of a number of x- 
tuples, where x-tuple is a set of mutually exclusive tuples (up to 
a constant number) called alternatives. Each x-tuple in a relation 
randomly instantiates into one tuple from its alternatives. For an 
uncertain relation with A^ tuples, our structure can answer top-fc 
queries in O(fclogAf) time, handles an update in 0(logA*') time 
and takes 0{N) space. Finally, we evaluate practical efficiency 
of our structure on both synthetic and real data. 

Index Terms — ignore 

I. Introduction 

The efficient processing of uncertain data is an important 
issue in many application domains because of the imprecise 
nature of data they generate. The nature of uncertainty in data 
is quite varied, and often depends on the application domain. 
In response to this need, much effort has been devoted to 
modehng uncertain data HT], O, Q, Q, Q. Most models 
have been adopted to possible world semantics, where an 
uncertain relation is viewed as a set of possible instances 
(worlds) and correlation among the tuples governs generation 
of these worlds. 

Consider traffic monitoring application data ifTOl (with mod- 
ified probabilities) as shown in Table 11] where radar is used 
to detect car speeds. In this application, data is inherently 
uncertain because of errors in reading introduced by nearby 
high voltage lines, interference from near by car, human 
operator error etc. If two radars at different locations detect the 



TABLE I 
Traffic MONITORING DATA: ti .{t2,t4}, {ts.tel.ts 



Time 


Car 
Loc 


Plate 

No 


Speed 


Prob 


Tuple 
Id 


11:55 


LI 


Y-245 


130 


0.30 


ti 


11:40 


L2 


X-123 


120 


0.40 


t2 


12:05 


L3 


Z-541 


110 


0.20 


t.i 


12:15 


L4 


X-123 


105 


0.50 


ti 


12:10 


L5 


L-UO 


95 


0.30 


t5 


11:35 


L6 


Z-541 


80 


0.45 


te 



presence of the same car within a short time interval, such as 
tuples ^2 and t^ as well as ^3 and tg, then 31 most one radar 
reading can be correct. We use x-relation model to capture 
such corrections. An x-tuple t specifies a set of exclusive 
tuples, subject to the constraint J2t er ^'''(^i) — 1- The fact 
that ^2 and t4 cannot be true at the same time, is captured by 
the a; -tuple ti = {^25^4}- Similarly T2 ~ {^3,^6}- Probability 
of a possible world is computed based on the existence prob- 
abilities of tuples present in a world and absence probabilities 
of tuples in the database that are not part of a possible world. 
For example, consider the possible world pw — {ti,t2,i3}- 
Its probability is computed by assuming the existence of ti, 
^2, ^3, and the absence of ^4, ^5, and tg- However since t2 and 
^4 are mutually exclusive presence of tuple ^2 implies absence 
of i4 and same is applicable for tuples is and t^. Therefore, 
Pr{pw) = 0.3 X 0.4 X 0.2 x (1 - 0.3) = 0.0168. 

Top-fc queries on a traditional certain database have been 
well studied. For such cases, each tuple is associated with 
a single score value assigned to it by a scoring function. 
There is a clear total ordering among tuples based on score, 
from which the top-fc tuples can be retrieved. However, for 
answering a top-fc query on uncertain data, we have to take 
into account both, ordering based on scores and ordering 
based on existence probabilities of tuples. Depending on how 
these two orderings are combined, various top-fc definitions 
with different semantics have been proposed in recent times. 
Most of the existing work studies only the problem of 
answering a top-fc query on a static uncertain data. Though 
the query time of an algorithm depends on the choice of 
a top-fc definition, linear scan of tuples achieves the best 
bound so far. Therefore, recomputing top-fc answers in 



an application with frequent insertions and deletions can 
be extremely inefficient. In this paper, we present a fully 
dynamic structure of size 0{N) that always maintains the 
correct answer to the top-fc query for an uncertain database. 
The structure is based on a decomposition of the problem 
so that updates can be handled efficiently. Our structure can 
answer the top-fc query in 0(A;logA^) time, handle update in 
O(logiV) time. 

Outline: In Section [HI we review different top-fc definitions 
proposed so far and try to compare them against a parame- 
terized ranking function PRF'^{a) proposed by Li et al. |8J. 
We choose PRF'^{a) over other definitions as it can approx- 
imate many of the other top-fc definitions and can handle 
data updates efficiently. After formally defining the problem 



(Section IIIi, we explain how PRF'^{a) can be computed 
using divide and conquer approach (Section |IV| l, which forms 
the basis of our data structure explained in Section W\ We 
present experimental study with real and synthetic data sets in 



Section VI Finally we review the related work in Section VII 
before concluding the paper 

II. TOP-fc QUERIES ON UNCERTAIN DATA 

Soliman et al. [10] first considered the problem of ranking 
tuples when there is both a score and probability for each 
tuple. Several other definitions of ranking have been proposed 
since then for probabilistic data. 

• Uncertain Top-fc (U-Topk) ifTOl : It returns a fc-tuple set 
that appears as top-fc answer in possible worlds with 
maximum probability. 

• Uncertain Rank-fc (U-Ranks) |10|: It returns a tuple for 
each i, such that it has maximum probability of appearing 
at rank i across all possible worlds. 

• Probabilistic Threshold Query (PT-k) |4j: It returns all 
the tuples with probability of appearing in top-fc greater 
than a user specified threshold. 

• Expected Rank (E-Rank) [2|: fc tuples with highest value 
of expected rank (er(ti)) are returned. 

(^r{U) == y^ Pr{pw)rankpw {U) 

where rankp^{ti) denotes rank of ti in a possible 
world pw. In case ti does not appear in possible world, 
rankp^{ti) is defined as \pw\. 

• Expected Score (E-Score) |2J: fc tuples with highest value 
of expected score (es(ii)) are returned. 



es{ti) = Pr{ti)score{ti) 

Parameterized Ranking Function (PRE) fF"]: PRF in its 
most general form is defined as. 



T(tO = ^u>(t,,r)xPr(i„r) 



(1) 



the probability of a tuple t, being ranked at position r 
across all possible worlds. A top-fc query returns those 
fc tuples with the highest T values. Different weight 
functions can be plugged in to the above definition to 
get a range of ranking functions, subsuming most of top- 
fc definitions listed above. A special ranking function 
PRF'^{a) is obtained by choosing w{ti,r) — a'^~^, 
where a is a constant. Experimental study in [8| reveals 
that for some value of a with the constraint a < 1, PRF'^ 
can approximate many existing top-fc definitions. 
Algorithms for computing top-fc answers using the above 
ranking functions have been studied for static data. Any 
changes in the underlying data forces re-computation of query 
answers. To understand the impact of a change on top-fc 
answers, we analyze relative ordering of the tuples before and 
after a change, based on these ranking functions. 

Let T = ti,t2,..,tN denote independent tuples sorted in 
non-increasing order of their score. We choose insertion of a 
tuple as a representative case for changes in T, and monitor 
its impact on relative ordering of a pair of tuples (ti, tj). 
Since E-Score of a tuple depends only on its score and 
existence probabiUty, ordering is preserved for all (ti, tj) pairs 
in T. For ranking functions U-Ranks, PT-k ordering of 
tuples [ti, tj) may or may not be preserved by insertion and 
cannot be guaranteed when the score of a new tuple is higher 
than that of ti and tj. Hence, existing top-fc answers do not 
provide any useful information for re-computation of query 
answers. E-Rank further complicates the matter as expected 
rank of a tuple depends on both higher and lower scored 
tuples. However, when tuples are ranked using PRF'^{a), the 
scope of disturbance in the relative ordering of tuples is limited 
as explained in later sections. This enables efficient handling 
of updates in the database. Therefore, this ranking function 
is well suited for answering top-fc queries on a dynamic 
collection of tuples. 

III. Problem Statement 

Given an uncertain relation T of a dynamic collection 
of tuples, such that each tuple ti E T is associated with 
a membership probability value Pr{ti) > and a score 
score{ti) computed based on a scoring function, the goal is 
to retrieve the Top-fc tuples. 

We use the parameterized ranking function PRF'^{a) pro- 
posed by Is] in this paper. PRF'^{a) is defined as. 



T(i,)=^«'-ixPr(i„r) 



(2) 



where w is the weight function that maps a given tuple- 
rank pair to a complex number and Pr{ti,r) denotes 



where a is a constant and Pr(ti,r) denotes the probability 
of a tuple ti being ranked at position r across all possible 
worldQ A top-fc query returns the fc tuples with highest 
T values. We refer to T{ti) as the rank-score of tuple 
ti. In this paper, we adopt the x-relation model to capture 
corrections. An x- tuple r specifies a set of exclusive tuples, 

*Pr{ti,r) = 0, for r > i. 



subject to the constraint '^t.^T Pr{ti) < 1. In a randomly 
instantiated world r takes ti with probability Pr{ti), for 
i — l,2,...,|r| or does not appear at all with probability 
1 — J2teT -P^i^i)- Here \t\ represents the number of tuples 
belonging to set r. Let r(ti) represents an a; -tuple to which 
tuple ti belongs to. In x-relation model, T can be thought of as 
a collection of pairwise-disjoint x-tuples. Let J^tet I'''! ~ ^ 
i.e. there are total N tuples in an uncertain relation T. Without 
loss of generality, we assume all scores to be unique and let 
ti,t2, ■■■,tN denotes ordering of the tuples in T when sorted 
in descending order of the score {score{ti) > score{ti^i)). 
From now onwards we represent Pr{ti) by short notation pi 
for simplicity. 

IV. Computing Pi?^'^ (a) 

In this section, we derive a closed form expression for the 
rank-score T{ti), followed by an algorithm for retrieving 
the Top-1 tuple from a collection of independent tuples. In 
the next section we show that this approach can be easily 
extended to a data structure for efficiently retrieving Top- 
k tuples from a dynamic collection of tuples. We begin by 
assuming tuple independence and then consider correlated 
tuples, where correlations are represented using x-tuples. 

A. Assuming tuple independence: 

When all tuples are independent, tuple ti appears at position 
r in a possible word pw if and only if exactly (r — 1) 
tuples with a higher score value appear in pw. Let Si^r 
be the probability that a randomly generated world from 
{^1,^27 •■•j^i} has exactly r tuples. Then, probability of a tuple 
ti being ranked at r is given as 



where. 



^i,r — \ ^ 



Pr{ti,r) ^p.Si-i^r- 



PiSt-i;r-i + (1 - PijSi^i^r if i > r > 

if i = r = 
otherwise. 



(3) 



Using above recursion for Si^r and equation l2j l3] 



Pi 

Similarly, 

T(tz+i) 



/ ,«'' ^Si-i^r-l = } ^c/'Si-i.r 



Pi+1 ^ 

= ^ a''{piSi-i.r-l + (1 - Pi)Si^l^r) 

= o^Pi X! "'^"^'5'i_i,r_i + (1 - Pi) ^ a^'Si-i. 

r r 

= (1 - (1 - a)p,)T(t.)M 



We have the base case, T(<i) = pi- Therefore, 

Tit,)^p,]l{l-{l~a)p,) (4^ 

j<i 

Now, we analyze the contribution of a tuple ti towards 
global ranking over T using the above formula as follows. 

• Tuple ti contributes nii = pi for the computation of its 
own rank-score. 

• Tuple ti contributes Ci — 1 — (1 — a)pi of computing 
rank-score for all tuples having score less than that of 

Theorem 1: When all tuples in T are independent, 

rank-score of a tuple ti can be computed as follows. 



T{ti) = TOi]^( 



j<i 



where rrii = pi and c^ = 1 — (1 — a)pj 



D 



Answering Top-1 query: 

We use a divide and conquer approach for answering top-1 
query on T, which forms the basis for our data structure 
in later section. Let the given relation T = {ii, ^2, ■••,%} 
be partitioned into sub-reltations T; = {ti,t2, ■■■,t\N/2]} 
and Tr = {tiN/2]+i,t[N/2]+2, ■■■,tN}- Also let t' and T 
represent the top-1 answer for T; and Tr with rank-scores 
Tt, (i') and Tt^{V) respectively, where Tt, (i') is computed 
by considering only those tuples tj S T; and T^,, (i') is is 
computed by considering only those tuples tj £ T^. 



For t,€Ti, 



'^TiiU) = mt Yl. ^3 



j<i 
tjETi 

and similarly for ti e Tr, 

'^TriU) = mi Y[ Cj 

j<i 

Now when both relations Ti and Tr are merged to form T, 
we make the following observations using the above analysis: 

• The contribution of each tuple towards its own 
rank-score remains unchanged. 

• Since all the tuples in T,. have a lower score value than 
any tuple ti e T; they do not contribute towards the 
rank-score value of ti computed over entire relation 
T. Thus T{ti) = TT,(ij). Hence t' still has the highest 
rank-score value T(t') among the tuples in Tj. 

• Since all the tuples in Ti have higher score value than 
any tuple ti S Tr, each tj E Ti contributes 1 — (1 — a)pj 
towards rank-score value of ti computed over entire 
relation T. Let Q = Ut.eTi Cj = Ut^eT^ 1 - (1 - a)pj 
represents overall contribution of sub-relation TJ. Then 
T{ti) ~ CiTT,.{ti). Since rank-score value of every 
tuple ti e Tr gets scaled by the same factor Ci, f still has 



the highest rank-score value T(t'') among the tuples 

in Tr. 
Therefore the top-1 answer over uncertain relation T can be 
chosen from t' and V based on the their rank-score values 
computed over the entire relation. 

B. Supporting correlations 

If ti has some preceding alternatives, then the event that 
ti appears is no longer independent of the event that exactly 
j — 1 tuples appear in {ii, ^2, ■••,^j-i}, as in equation [3] 
Hence equation HI cannot be used to compute the rank-score 
of a tuple ti. To overcome this difficulty, we convert the 
relation T to T* where all the tuples are independent lfT2l . 
Let T* = {tj\tj S r, j < i}. Now for each x-tuple t ^ T, 
we create an x-tuple f — {t} in T\ where p{t) = Pr{T^) 
with one exception. For tuple t £ T^ which corresponds to 
T{ti) G T, we use Pr{t) — pi, where T{ti) is the z-tuple to 
which the tuple ti belongs to. 

For example, T = {ti,T2,T3} where, n = {ti,t3,tQ},T2 = 
{^2,^?} and T3 = {t4,t5}- Then rf = {^1,^3} and t{<5) = T3. 

This conversion takes into account the fact that only tuples 
with a score higher than that of ti contribute to Pr{ti,r) as 
well as to T{ti), and the presence of ti implies absence of all 
its related tuples. 

Since all the tuples in T' are independent among them- 
selves, we can now use equation U\ on T* to compute the 
rank-score of tuple ti. Combining related tuples into a 
representative tuple t does not affect T{ti) here, since the 
probability that t appears is the same as the probability that 
one tuple in t G T with score higher than score{ti) appears. 
Therefore, 

r{t^=P^ n (1 - (1 - ")^Ki)) 






= p, n (l-(l-a)Pr(TO) 



(5) 






Now, we analyze the contribution of an x-tuple towards 
global ranking over T using the above formula as follows. 

• a; -tuple t contributes rrii = pi for computing 

rank-score of a tuple ti G r. 

• a; -tuple r contributes c,; ~ 1 — (1 — a)Pr{T^) for 
computing rank-score of a tuple ti ^ r. 

Answering Top-1 query: 

Again, we attempt to use a divide and conquer algorithm 
for answering top-1 query on T by partitioning relation 
T = {ti,t2, ■•■,ijv} into sub-relations T/ = {ii,i2, ••■,irAr/2]} 
and Tr = {i[Ar/2]+i,i[Ar/2]+2, •••,^Ar} and assuming t' and 
f^ represent the top-1 answers for Ti and Tr respectively. If 
property that i' and t^ remains highest rank-score tuples in 
their respective sub-relations even after merging of T; and Tr, 
holds true then reporting top-1 for relation T can be done by 
simply comparing rank-score values of t' and f over entire 



relation T. Unfortunately, this property may not hold true for 

r. 

To illustrate the problem, consider an uncertain relation 

T = {^1,^2,^3,^4} with pi = 0.'S5,p2 = 0.3,P3 = 0.4, p4 = 

0.45 and tuples ^2 and ^3 are mutually exclusive. Using 

equation |5] rank-scores can be computed as follows 

(a = 0.8): 

T(ii) = 0.35 

T(i2) = 0.3(1 - 0.2 X 0.35) = 0.28 

T(i3) = 0.4(1 - 0.2 X 0.35) = 0.37 

T(i4) = 0.45(1 - 0.2 X 0.35)(1 - 0.2 x (0.3 + 0.4)) = 0.36 

Top-1 query on T should return tuple ^3 with highest 
rank-score value 0.37. By adopting the divide and conquer 
approach to tackle the problem, we partition the given relation 
into Ti — {ti,t2} and Tr — {t^jt^}. Top-1 query is applied 
to these sub-relations as follows. 
TT,(ii) = 0.35 
Tt, (^2) == 0.3(1 - 0.2 X 0.35) = 0.28 

TT,(i3)=0.4 

TT,(i4) = 0.45(1 - 0.2 X 0.4) = 0.41 

Thus ti and ^4 will be reported from T; and T,. as top- 
1 answers respectively. By simple merge operation, which 
computes rank-score values for ti, t4 over relation T 
and compares them, ti will be reported as top-1 answer for 
T. However actual top-1 answer is tuple ^3. The fact that 
dependance of t2 and ^3 was ignored while answering top-1 
over sub-relation Tr is the root cause behind the disturbance 
in relative ordering of ^3 and ^4. 

Therefore in order to maintain the relative ordering of tuples 
based on their rank-score over entire relation during merge, 
we redefine the expressions for contributions as follows. Here 
we use the notation pi for sum of probabilities of all tuples tj 
which are related to ti and have score greater than the score 
of ti (i.e. j < i). In the above example ps = P2 = 0.3. 



p, = Pr{[T{t,)Y) ^ Yl Pi 

j<i 

Now equation l5] can be re arranged as follows. 



T(tO - 



(1 - (1 - a)p.) 



l[{l-il-a)PriT^)) 



tGT 






Ylil - {1 - a)Pr{T^)) 



where m^ = j-^ — .,^' ^- . 



similarly. 



T(t 



i+l) 



m^+i 



= l[il-il-a)PriT^+')) 



tET 



Here note that Pr{T^') = Prlr^'^^) for all r 7^ T(ti). From 
the above two equations, 



T(t,+i)\ .fT{U)\ 1 - {1 - a)Pr{[T{ti)Y+^) 



H+i 



/ 



The base case is T{ti) 
equation [5] as follows, 



T(i, 



i+l) 



T(t, 



rui+i 



^i^i 



1 - (1 - a)Pr{[r{U)Y) 
l-(l-a)(j3,+p,) 
1 - (1 - a)pi 

Ci 

Pi. Therefore we can rewrite 






(6) 



The result is summarized in following theorem. 
Theorem 2: For an uncertain relation T, rank-score of a 
tuple ti can be computed as. 



T{U 



m, 



H' 



j<i 



where rrii 



1— (1 — a)pi ^'' 



(l-(l--a)pO' ' 

^ir, where ti and t^ are mutually exclusive and r <i 



This equation is applicable for dependent as well as in- 
dependent tuples. Note that here rrii and q are dependent 
only on the tuples which are related to ti, hence can be 
computed/updated efficiently. Moreover, the contribution c, of 
a tuple ti to the rank-score of a tuple tj is the same for all 
j > i- Hence, the relative ordering will not change even if we 
use our divide and conquer approach. 

Consider the same example as before. We begin by 
computing values of rrii and Ci for each tuple. 

nil = 0.35 

1712 — 0.3 

TO3 = 0.4/(1-0.2x0.3) = 0.43 

m4 = 0.45 



ci = (1-0.2X 0.35) =0.93 
C2 = (1-0.2X 0.3) =0.94 
C3 = (1-0.2X (0.3 + 0.4))/(l 
a = (1-0.2 X 0.45) =0.91 



0.2x0.3) = 0.91 



Now, we partition T into T; = {ti,t2} and Tr — {^3,^4} and 
apply Top-1 query to these sub-relations. 

TT,(ti) = mi =0.35 

TT,(t2) = ma X ci = 0.3 x 0.94 = 0.28 

TT.(i3)-™3 = 0.43 

Tr^iti) ^niiX C3 = 0.45 x 0.91 = 0.41 

It can be seen that from ti and ^3 are chosen as Top-1 from 
T; and T^ respectively. During next comparison, ^3 (T(t3) = 
TO3 X ci X C2 = 0.37) will be reported as the Top-1 tuple, 
which is correct. 



V. Our Data Structure: 

In the earlier sections, we derived the simple closed form 
expression for calculating T{ti) for a tuple ti. Now our task 
is to maintain a dynamic collection of tuples, such that for 
a given query fc, we retrieve Top-fc rank-scored tuples 
efficiently. We use data structural approach for this problem. 
Our structure is a balanced binary search tree A such that 
each leaf corresponds to a tuple in an uncertain relation T. 
Moreover, leaves in the tree are sorted in decreasing order of 
the score i.e. leaves £1,(2, ■■■j^n of the tree represent tuples 
ti,t2, ■■■,tN in the same order from left to right, such that 
score{ti) > score{ti^i). Let T^ represents the sub-relation 
containing tuples associated with leaves of a subtree rooted at 
node u. i.e. T„ = {tu',tu'+i, ■■■,tu"} and £u' represents the 
left-most and £u" represents the right-most leaf of node u. At 
each node u, we store a triplet (topu,Mu,Cu) such that: 

• topu is the tuple (represented by ^„.) with highest 
rank-score among tuples in sub-relation r„. Here 



Mu is the contribution of all tuples in r„ 
rank-score of tuple topu- 



towards 



Mu = niu* Yl 



u'<i<.u* 



Cu is the contribution of aU tuples in Tu towards tuple 
ti such that i > u", where £u" is the right-most leaf of 
the subtree rooted at node u. 



c„= n 



u'<i<u" 

Since our data structure stores only a constant number 
of information at each node, and the number of nodes are 
bounded by 0{N), the total space requirement of our data 
structure is 0{N). 

If node u is a leaf node representing the tuple ti, then Af„ = 
niijtopu = ti and C„ = q. If u is an internal node, this 
information can be computed using the MERGE operation given 
below. Figure [T] shows an example for the uncertain data in 
table M 



MERGE (u) 

V = left — child{u) 

w = right — child{u) 

M„ = max {My,Cy x M^) 

topu = topv, if My > Cy X Mu 



else topu = topu 



Theorem 3: The data structure A maintains a dynamic 
collections of tuples such that Top-1 tuple, t^ = toproot and 

T(il) = Mroot- 

Proof by contradiction: Let ta be the actual Top-1 
and toproot 7^ ta- Let u be the closest node from root, 
such that topu = ta, that means topparent(ii) = h j^ ta- 
This is because during the merge operation at parent{u). 



< fTif, Ylx<i<b '^i ' where £x is the leftmost 
leaf of parent (u)- Multiplying both the sides of the equation 



n 



x<i<a 



TABLE II 

Calculation of rank-scores (with a = 0.9) of tuples in table[i] 



Tuple 


Prob 


m 


c 


T 


h 


0.30 


0.300 


0.970 


0.300 


fa 


0.40 


0.400 


0.960 


0.388 


is 


0.20 


0.200 


0.980 


0.186 


ti 


0.50 


0.521 


0.948 


0.475 


ts 


0.30 


0.300 


0.970 


0.260 


te 


0.45 


0.459 


0.954 


0.385 



with ni<x '^i' ^^ S^*- '^(^a) < T(if,), which is a contradiction 
to the statement that ta is the highest rank-scored 
tuple. Therefore t^{= ta) will always be at the root and 

Mroot = ma Ul<^<a C^ = T(ia) = T{t^). D 



(t2, 0.388, 0.913) 



(t4, 0.52, 0.877) 




(t6, 0.4S9, 0.954) 



(tl, 0.3, 0.97) (t2, 0.4, 0.96) 



(t4, 0.52, 0.948) (t5, 0.3, 0.97) 



Fig. 1 . The data structure for uncertain database in Table In] 

In the following subsections, we show how to perform 
different operations such as update-leaf, insert-leaf 
and delete-leaf on this tree. Later, we use these oper- 
ations for retrieving Top-fc tuples, insertion and deletion of 
tuples. 

A. Update-leaf 

The values rrii and Ci within a leaf node £i can be changed 
in constant time. But this will change the ni and c values at 
all nodes which are in that path from £i to root. Therefore we 
need to perform MERGE operation on all nodes in the path 
from £i to root, starting from parent{£i). Since the height of 
a balanced binary tree is bounded by 0(log A^), the total time 
for update-leaf can also be bounded by O(logiV). 

Theorem 4: The nii and ci values of a leaf can be updated 
in O(logA^) time. 

B. Insert-leaf and delete-leaf 

We first explain, how one-one correspondence between tree 
leaves and tuples in relation T can be maintained during 



insertion or deletion of a leaf. 

• Insert: To insert a new leaf, we begin by carrying out 
standard insert procedure of a binary search tree, which 
would create a new leaf node v. Let w be the parent 
of this newly created node. Node w being the leaf prior 
to insertion of v, represents a single tuple from T and 
should remain as a leaf after insertion of v as well. This 
can be achieved by creating a new internal node u, which 
becomes the parent of v and w. 

• Delete: If deletion of a node results in an internal node 
with only one child, we perform recursive delete on that 
internal node. 

After insert or delete of a leaf node £i, we need to update 
the M and C values at each node along the path of insertion 
or deletion. This can be achieved by performing MERGE oper- 
ation in bottom-up fashion beginning with parent{£i). If tree 
goes out of balance after insert or delete, necessary rebalancing 
may force further re-computation at nodes whose left or right 
subtree is changed. However, such nodes are bounded by the 
height (O(logA^)) of the tree. Hence Insert-leaf and 
leaf -delete operations can be done 0(log A^) time. 

C. Retrieving Top-k tuples 

In theorem 3, we proved that, by MERGE operation the Top- 
1 tuple t^ will be the propagated to root node as toproot- 
Therefore t^ can be retrieved in constant time. In order to 
retrieve the Top-2 tuple t^, we use the following strategy. 
After retrieving t^, we set T(t^) — 0. As a result, the 
next highest rank — scored tuple t^ will be propagated as 
toproot instead of t^. This can be achieved by performing 
Update-leaf operation on leaf £j (leaf representing the cur- 
rent toproot — tj), with it nij value set to zero. As Cj remains 
unchanged, update operation affects only the computation of 
rank-score of tj leaving rank-score of all other tuples 
unchanged. Repeating the same process, we can retrieve top- 
k tuples with highest rank-score values. We can revert 
back the changes done in data structure for answering top-k 
query by restoring the m values for k retrieved tuples using 
Update-leaf operation. 

Top-fc 

for i = 1 to fc 

j — ^OProot 

report toproot as top-i tuple 

Update-leaf(tj) with rrij = 



Figure l2] shows an example for retrieving Top-2 tuple from 
the uncertain data in table |I] 

Theorem 5: Top-fc rank-scored tuples can be retrieved 
in O(fclogA^) time. 

Proof: For every tuple tj retrieved for answering top-fc 
query, we perform Update-leaf operation twice: once for 
setting nij = so that tuple with next highest rank-score 
can be retrieved and next after reporting top-fc answers so as to 



restore the tree changes. Since Update-leaf is a 0(log A^) 
time operation, total time for Top-A: retrieval can be bounded 

by O(fclogA^). 
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Fig. 3. The data structure in figl after inserting t* 



Fig. 2. Tlie data structure after setting m4 = for retrieving Top-2 

D. Insert-tuple and delete-tuple 

Whenever a tuple ti gets inserted(deleted) from relation T, 
we modify our data structure as follows: 

• We begin by carrying out Insert -leaf or 
leaf-delete operation as necessary. If ti is an 
independent tuple then at this point all nodes in the tree 
A have correct values for C and M. Hence no further 
action is necessary. 

• If ti is not independent, then its insertion(deletion) will 
change nij and Cj values for all leaf nodes correspond- 
ing to tuple tj such that j > i and r(ii) = T{tj). 
These change can be accommodated by performing 
Update-leaf operation on each £j. 

Figure [3] shows an example of inserting a new tuple i*(with 
score{t2) > score{t*) > score{t^)) and is mutually exclusive 
with ts in the uncertain data in table [HI and figure |4] shows an 
example for deletion of a tuple. 

Thus insertion(deletion) of a tuple can result in one 
Insert-leaf or leaf-delete operation and at max 
|r(ti)| Update-leaf operations. Since any x-tuple can 
have only constant number of operations, tuple insertion and 
deletion can be handled in O(logA^) time. We note that 
updating of tuples can be simulated by first deleting and then 
reinserting it with updated values. 

We summarize the space requirement and performance of 
the proposed data structure in the following theorem. 

Theorem 6: A collection of uncertain data can be main- 
tained using a linear size dynamic data structure, which can 
retrieve Top-fc rank-scored tuples in O(fclogiV) time, and 
can support insertion or deletion of a tuple t in 0{d\ogN) 
time, where d is the number of tuples which are related to t. 

D 
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Fig. 4. The data structure in figl after deleting ti 



VI. Experimental Study 

In this section, we present an experimental study with 
both synthetic and real data evaluating effectiveness of the 
data structure in handling changes in underlying database and 
answering top-fc queries. All experiments were conducted on 
2.4 GHz Intel Core 2 Duo machine with 2GB memory running 
MAC OS 10.6.4. 

Datasets: We created a synthetic dataset containing 1,00,000 
tuples. Score of a each tuple is chosen uniformly at random 
from [0,100000] and it's probability is uniformly distributed 
in (0.5 X 10~^, 1.5 x 10^^). The number of tuples involved in 
each x-tuple follows the uniform distribution (2,10). 

Along with synthetic datasets, we also use International 



Ice Patrol(IIP) Iceberg Sighting Database [M Each sighting 
record in the database contains date, location, number of days 
the iceberg has drifted, etc. As it is crucial to detect the 
icebergs drifting for long periods, we use the number of days 
drifted as ranking score. The sighting record is also contains a 
confidence-level attribute according to the source of sighting: 
RA'^ (radar and visual), VIS (visual only), RAD (radar only), 
SAT-LOW (low earth orbit satellite), SAT-MED (medium earth 
orbit satellite), SAT-HIGH (high earth orbit satellite), and EST 
(estimated). We converted these seven confidence levels into 
probabilities 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, and 0.4 respectively. 
We gathered all records from 1981 to 1991 and 1998 to 
2004. Based on it then we created 1,00,000 tuples dataset by 
repeatedly selecting records randomly. 

Results: For all of our experiments we choose a = \ — 0.9^°. 
We begin by evaluating the query performance of the data 
structure. We retrieve top-fc tuples from both the datasets for 
k ranging from 10 to 100. Linear dependance of query time as 
obtained in the time bounds is evident from the results show 
in Figure [5] Also we can note that, correlations among tuples 
does not affect the query time of our data structure. 



to be less than 1 to which new tuple is inserted. For deletion, 
victim tuple is selected at random. Figure l6] and iTlalso shows 
the effect of varying data size on query performance of data 
structure. 
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Fig. 6. Processing (insert, delete, top-fc) cost on real dataset 




Fig. 5. Top-fc query performance on real and synthetic data 

Next set of experiments conducted shows efficiency of our 
data structure in handling tuple insertions and deletions. Time 
required for inserting and deleting 100 tuples is measured for 
datasets of varying sizes. Figurel6]and ITlshows that processing 
time per tuple increase slowly with data size. Whenever a 
tuple is inserted or deleted, to maintain the correctness of 
data structure, we also need to update information for leaves 
corresponding to its related tuples. As all tuples in real data 
set are assumed to be independent average insertion/deletion 
time of a tuple is less than in case of synthetic data having 
correlations. This can be seen from the results in figure l6] and 
It] For synthetic dataset, we insert a tuple in dataset such that it 
is related to existing tuples. We ensure the x-tuple probability 

'http://nsidc.org/data/g00807.html 



1800 
1600 
1400 

^ 1200 

aj 

o 

■| 1000 

aj 

p 800 

600 

400 

200 



Top-100 - 
lnserl-100 
Delete-100 - 



10000 20000 30000 40000 50000 60000 70000 80000 90000 100000 
Number of tuples in data set 



Fig. 7. Processing (insert, delete, top-fc) cost on synthetic dataset 



Data structure proposed in this paper can be used when data 
arrives in streaming fashion. Jin et al. |5| have studied the 
problem of answering top-fc queries on sliding windows. Our 
data structure achieves performance comparable to synopses 
proposed by them in terms of handling tuple insertion and 
deletions. Even though our data structure takes linear size as 
compared to these space efficient synopses, it can be noted 
that they rely on random order stream model used in streams 
algorithm community 1T31 . 1(141 . llT6l and in worst case would 
take linear size as well. 



VII. Related Work 

Uncertain data management has attracted a lot of attention 
in recent years due to an increase in the number of appli- 
cation domains that naturally generate uncertain data. These 
include sensor networks [17|, data cleaning [IS] and data 
integration 1191 . Il20l . Several probabilistic data models have 
been proposed to capture data uncertainty (e.g TRIO ifTTI . 
MYSTIQ 13 1, MayBMS |6|, ORION |1|, PrDB 11). VirtuaUy 
all models have adopted possible worlds semantics. Each data 
model captures tuple uncertainty (existence probabilities are 
attached to the tuples of the database), or attribute uncertainty 
(probability distributions are attached to the attributes) or 
both. Further distinction can be made among these models 
based on support for correlations. Most of the work in 
probabilistic databases has either assumed independence or 
supports restricted correlations, mutual exclusion being the 
most common. Recently proposed approaches H, Q extend 
the support for any arbitrary correlations. 

Efforts have been made in recent times to extend the 
semantics of "top-fc" to uncertain databases. Soliman et 
al. ifTOl defined the problem of ranking over uncertain 
databases. They proposed two ranking functions, namely 
U-Topfc and U-fcRanks, and proposed algorithms 
for each of them. Improved algorithms for the same 
ranking functions were presented later by Yi et al. lfT2l . 
Hua et al. |4| proposed another top-fc definition PT-A: 
(probabilistic threshold queries) and proposed efficient 
solutions. Cormode et al. IJ] defined number of key 
properties satisfied by "top-fc" over deterministic data 
including exact-fc', containment, unique-rank, 
value-invariance, and stability. With each of 
the existing top-fc definition lacking one or more of these 
properties, Cormode at al. [2| proposed yet another ranking 
function expected-rank. As the list of top-fc definitions 
continued to grow, Li et al. [Sj argued that a single specific 
ranking function may not be appropriate to rank different 
uncertain databases and empirically illustrated the diverse, 
conflicting nature of parameterized ranking functions that 
generalize or can approximate many know ranking functions. 

With most of the work for top-fc query processing being 
focused on "one-shot" top-fc query for static uncertain data, 
Chen and Yi lITSl was the first to address the dynamic 
aspect of uncertain data. They proposed a fully dynamic data 
structure to support arbitrary insertions and deletions. For an 
uncertain relation with N tuples, the structure of lITSl answers 
top-fc queries in 0{k + logiV) time, handles an update in 
O(fclogfclogiV) time and takes 0{N) space. However, this 
structure is tied to a single ranking function i.e. U-Topfc 
and works only for independent tuples. Moreover, it can be 
built for some fixed fc value and cannot answer a top-j for 
j > fc. Dependance of time, required for handling update, 
on fc is also not desirable. Recently, Jin et al. |5| proposed 
a framework for sliding window top-fc queries on uncertain 
streams supporting several ranking functions. This framework 
assumes random-order stream model (tuples arrive 



in a random order) which significantly reduces the space 
requirement as compared to the worst-case scenario in which 
any data structure will have to remember every tuple in the 
current window. 

VIII. Conclusions 

In this paper we present a dynamic data structure, which 
can retrieve top-fc tuples in O(fclogiV) time and has update 
cost of 0(log N). We also evaluate efficiency of proposed data 
structure with experiments using synthetic and real data. It is 
an open question if, we can improve the top-fc retrieval time 
to 0(fc + logiV) without sacrificing update time or is there 
any lower bound for this problem? 
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